Đồ án cuối khóa Snowflake Bootcamp: Xây dựng một hệ thống phân tích dữ liệu kinh doanh hoàn chỉnh sử dụng bộ dữ liệu TPC-H - một benchmark chuẩn quốc tế cho data warehouses.
Dự án này triển khai một data warehouse hiện đại với:
Các bảng chính:
Mối quan hệ giữa các bảng:
erDiagram REGION ||--o{ NATION : "1:N (R_REGIONKEY)" NATION ||--o{ CUSTOMER : "1:N (N_NATIONKEY)" NATION ||--o{ SUPPLIER : "1:N (N_NATIONKEY)" CUSTOMER ||--o{ ORDERS : "1:N (C_CUSTKEY)" PART ||--o{ PARTSUPP : "1:N (P_PARTKEY)" SUPPLIER ||--o{ PARTSUPP : "1:N (S_SUPPKEY)" ORDERS ||--o{ LINEITEM : "1:N (O_ORDERKEY)" PART ||--o{ LINEITEM : "1:N (L_PARTKEY)" SUPPLIER ||--o{ LINEITEM : "1:N (S_SUPPKEY)" REGION { int R_REGIONKEY PK char R_NAME varchar R_COMMENT } NATION { int N_NATIONKEY PK char N_NAME int N_REGIONKEY FK varchar N_COMMENT } CUSTOMER { int C_CUSTKEY PK varchar C_NAME int C_NATIONKEY FK decimal C_ACCTBAL } SUPPLIER { int S_SUPPKEY PK char S_NAME int S_NATIONKEY FK decimal S_ACCTBAL } PART { int P_PARTKEY PK varchar P_NAME varchar P_BRAND decimal P_RETAILPRICE } PARTSUPP { int PS_PARTKEY PK, FK int PS_SUPPKEY PK, FK int PS_AVAILQTY decimal PS_SUPPLYCOST } ORDERS { int O_ORDERKEY PK int O_CUSTKEY FK char O_ORDERSTATUS date O_ORDERDATE } LINEITEM { int L_ORDERKEY PK, FK int L_PARTKEY FK int L_SUPPKEY FK int L_LINENUMBER PK decimal L_QUANTITY date L_SHIPDATE }
graph TD %% Nodes Source[(Snowflake Data Sources<br/>TPCH_SF1)] subgraph Bronze [BRONZE LAYER - Raw Data] direction TB B_Table[Schema: STAGING] B_Desc[• Direct copy<br/>• No transformations<br/>• Metadata added] end subgraph Silver [SILVER LAYER - Cleaned & Enriched] direction TB S_Table[Schema: ANALYTICS] S_Desc[• Cleaning & Validation<br/>• Deduplication<br/>• Business Logic] end subgraph Gold [GOLD LAYER - Business Metrics] direction TB G_Table[Schema: REPORTS] G_Desc[• Aggregated Metrics<br/>• KPIs & Dashboards<br/>• Business Insights] end %% Processes CDC{STREAMS<br/>CDC} Task1[[TASKS<br/>Transformation]] Task2[[TASKS<br/>Aggregations]] %% Flow Source -->|Copy Into| B_Table B_Table --> CDC CDC --> Task1 Task1 --> S_Table S_Table --> Task2 Task2 --> G_Table %% Styling style Source fill:#1e293b,stroke:#334155,stroke-width:2px,color:#fff style Bronze fill:#cd7f32,stroke:#fff,stroke-width:1px,color:#fff style Silver fill:#94a3b8,stroke:#fff,stroke-width:1px,color:#fff style Gold fill:#fbbf24,stroke:#fff,stroke-width:1px,color:#333 style CDC fill:#fff9c4,stroke:#fbc02d,stroke-dasharray: 5 5 style Task1 fill:#e1f5fe,stroke:#039be5 style Task2 fill:#e1f5fe,stroke:#039be5
graph LR %% Nodes definition Data[Files/Data] Bronze[Bronze<br/>Staging] Stream{{STREAM}} Task1[[TASK]] Silver[Silver<br/>Analytics] Task2[[TASK]] Gold[Gold<br/>Reports] %% Main Flow Data --> Bronze Bronze --> Stream Stream --> Task1 Task1 --> Silver Silver --> Task2 Task2 --> Gold %% Process Labels (Sub-labels) Bronze -.-> |Track Changes| Bronze Task1 -.-> |Transform & Clean| Task1 Task2 -.-> |Generate KPIs| Task2 %% Styling style Data fill:#f8fafc,stroke:#cbd5e1,stroke-width:2px style Bronze fill:#fff7ed,stroke:#fb923c,stroke-width:2px style Stream fill:#fefce8,stroke:#facc15,stroke-width:2px style Task1 fill:#f0f9ff,stroke:#0ea5e9,stroke-width:2px style Silver fill:#f1f5f9,stroke:#64748b,stroke-width:2px style Task2 fill:#f0f9ff,stroke:#0ea5e9,stroke-width:2px style Gold fill:#fffbeb,stroke:#f59e0b,stroke-width:2px %% Sub-label Styling linkStyle 6 stroke:#6366f1,stroke-width:1px,stroke-dasharray: 3 linkStyle 7 stroke:#6366f1,stroke-width:1px,stroke-dasharray: 3 linkStyle 8 stroke:#6366f1,stroke-width:1px,stroke-dasharray: 3
git clone <repository-url>
cd tpch_analytics_project
Chạy script sau với role ACCOUNTADMIN:
snowsql -f src/01_database_stage_roles.sql
Hoặc trong Snowsight UI:
src/01_database_stage_roles.sqlScript này sẽ:
Chạy script Medallion Architecture:
snowsql -f src/02_medallion_data_pipeline_automation.sql
Script này sẽ:
snowsql -f src/03_data_quality_check.sql
Script này thực hiện:
snowsql -f src/04_masking_policies_secure_data_sharing.sql
Script này triển khai:
Cài đặt dependencies:
pip install snowflake-snowpark-python pandas
Cập nhật connection parameters trong src/05_snowpark.py:
connection_parameters = {
"account": "<your_account>",
"user": "<your_username>",
"password": "<your_password>",
...
}
Chạy script:
python src/05_snowpark.py
Script này tạo:
snowsql -f src/05_udfs.sql
Script này tạo:
tpch_analytics_project/
│
├── README.md # Tài liệu dự án
│
├── src/
│ ├── 01_database_stage_roles.sql # Setup database, roles, staging
│ ├── 02_medallion_data_pipeline_automation.sql # Medallion architecture & automation
│ ├── 03_data_quality_check.sql # Data profiling & quality checks
│ ├── 04_masking_policies_secure_data_sharing.sql # Security & data sharing
│ ├── 05_snowpark.py # Snowpark Python analytics
│ └── 05_udfs.sql # User-defined functions
│
└── bonus/
└── visualizations.pdf # Dashboard visualizations (optional)
| Role | Permissions | Use Case |
|---|---|---|
| TPCH_ADMIN | Full access to all objects | Database administration |
| TPCH_DEVELOPER | Create/modify objects, INSERT/UPDATE/DELETE | Data engineering, ETL development |
| TPCH_ANALYST | SELECT on all schemas | Business analysis, reporting |
| TPCH_VIEWER | SELECT on REPORTS schema only | Dashboard viewing, read-only access |
| Table | Records | Description |
|---|---|---|
| REGION | 5 | Khu vực địa lý |
| NATION | 25 | Quốc gia |
| CUSTOMER | 150,000 | Khách hàng |
| SUPPLIER | 10,000 | Nhà cung cấp |
| PART | 200,000 | Sản phẩm |
| PARTSUPP | 800,000 | Quan hệ sản phẩm-nhà cung cấp |
| ORDERS | 1,500,000 | Đơn hàng |
| LINEITEM | 6,001,215 | Chi tiết đơn hàng |
| Table | Description | Key Features |
|---|---|---|
| ORDERS_SILVER | Cleaned orders | Status descriptions, date components, priority ranks |
| CUSTOMER_SILVER | Enriched customers | Nation/region names, cleaned phone, balance categories |
| LINEITEM_SILVER | Processed line items | Calculated amounts, date components |
| PART_SILVER | Categorized products | Type categories, size categories, price ranges |
| SUPPLIER_SILVER | Enriched suppliers | Nation/region names, balance status |
| Table | Description | Key Metrics |
|---|---|---|
| MONTHLY_SALES_REPORT | Monthly sales KPIs | Revenue, orders, customers, growth rates |
| CUSTOMER_METRICS | RFM analysis | Recency, Frequency, Monetary, segments |
| PRODUCT_PERFORMANCE | Product analytics | Sales, revenue, rankings |
| REGIONAL_ANALYSIS | Regional performance | Regional revenue, market share |
| Table | Description |
|---|---|
| CUSTOMER_RFM_SCORES | RFM segmentation results |
| MONTHLY_SALES_TRENDS | Monthly sales with growth metrics |
| QUARTERLY_SALES_TRENDS | Quarterly aggregations |
| PRODUCT_ANALYSIS_RESULTS | Detailed product performance |
| REGIONAL_PERFORMANCE_ANALYSIS | Regional market analysis |
EMAIL_MASK: Progressive masking based on role
PHONE_MASK: Phone number protection
SSN_MASK: Social Security Number protection
BALANCE_MASK: Financial data protection
CREDIT_CARD_MASK: Credit card protection (PCI compliance)
INCOME_MASK: Income data protection
CLASSIFY_CUSTOMER_BY_REVENUE(revenue) → Customer tier (VIP, GOLD, SILVER, BRONZE, STANDARD)VALIDATE_PHONE_NUMBER(phone) → Boolean validationVALIDATE_EMAIL(email) → Boolean validationGET_DISCOUNT_TIER(discount) → Discount categoryFORMAT_CURRENCY(amount) → Formatted currency stringGET_PRIORITY_SCORE(priority, price) → Priority scoreCALCULATE_SHIPPING_DELAY(ship, commit, receipt) → Delay in daysGET_SEASON(date) → Season nameCALCULATE_CLV_SCORE(recency, frequency, monetary) → CLV scoreCATEGORIZE_PRODUCT_PRICE(price) → Price categoryCALCULATE_SATISFACTION_SCORE(delivery_rate, discount, frequency) → Satisfaction score (0-100)CLEAN_PHONE_NUMBER(phone) → Standardized phone formatCALCULATE_PROFITABILITY_INDEX(revenue, cost, quantity) → Profitability indexGENERATE_ENGAGEMENT_SCORE(days, orders, avg_value) → Engagement JSONGET_CUSTOMER_COHORTS() → Returns cohort analysis tableSegments:
Use Cases:
SP_TRANSFORM_ORDERS_TO_SILVER(): Transform orders with enrichmentSP_TRANSFORM_CUSTOMER_TO_SILVER(): Transform customers with enrichmentSP_TRANSFORM_LINEITEM_TO_SILVER(): Transform line items with calculationsSP_GENERATE_MONTHLY_SALES_REPORT(): Generate monthly KPIsSP_GENERATE_CUSTOMER_METRICS(): Generate RFM metricsSP_GENERATE_PRODUCT_PERFORMANCE(): Generate product metricsSP_GENERATE_REGIONAL_ANALYSIS(): Generate regional metrics✅ NULL value checks across all tables ✅ Duplicate detection in primary keys ✅ Referential integrity validation ✅ Business logic constraints ✅ Statistical outlier detection ✅ Data completeness reports
SELECT
C_CUSTKEY,
C_NAME,
C_NATION_NAME,
SUM(O_TOTALPRICE) AS TOTAL_REVENUE,
CLASSIFY_CUSTOMER_BY_REVENUE(SUM(O_TOTALPRICE)) AS TIER
FROM ANALYTICS.CUSTOMER_SILVER C
JOIN ANALYTICS.ORDERS_SILVER O ON C.C_CUSTKEY = O.O_CUSTKEY
GROUP BY C_CUSTKEY, C_NAME, C_NATION_NAME
ORDER BY TOTAL_REVENUE DESC
LIMIT 10;
SELECT
REPORT_DATE,
YEAR,
MONTH_NAME,
TOTAL_ORDERS,
FORMAT_CURRENCY(TOTAL_REVENUE) AS REVENUE,
MOM_REVENUE_GROWTH
FROM REPORTS.MONTHLY_SALES_REPORT
ORDER BY REPORT_DATE DESC;
SELECT
RFM_SEGMENT,
COUNT(*) AS CUSTOMER_COUNT,
AVG(LIFETIME_VALUE) AS AVG_LTV,
AVG(FREQUENCY) AS AVG_ORDERS
FROM REPORTS.CUSTOMER_METRICS
GROUP BY RFM_SEGMENT
ORDER BY AVG_LTV DESC;
-- For date-range queries on ORDERS
ALTER TABLE ORDERS CLUSTER BY (O_ORDERDATE);
-- For regional queries on CUSTOMER
ALTER TABLE CUSTOMER CLUSTER BY (C_NATIONKEY);
-- Pre-aggregate frequently accessed metrics
CREATE MATERIALIZED VIEW DAILY_SALES_MV AS
SELECT
O_ORDERDATE,
COUNT(*) AS ORDER_COUNT,
SUM(O_TOTALPRICE) AS TOTAL_REVENUE
FROM ORDERS
GROUP BY O_ORDERDATE;
Issue: Tasks not running
-- Check task status
SHOW TASKS IN DATABASE TPCH_ANALYTICS_DB;
-- Check task history
SELECT * FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY())
WHERE NAME LIKE 'TASK_%'
ORDER BY SCHEDULED_TIME DESC;
-- Resume suspended tasks
ALTER TASK TASK_TRANSFORM_ORDERS_TO_SILVER RESUME;
Issue: Stream has no data
-- Check stream status
SHOW STREAMS;
-- Check if stream has data
SELECT SYSTEM$STREAM_HAS_DATA('ORDERS_STREAM');
-- Insert test data to trigger stream
INSERT INTO STAGING.ORDERS (...)
VALUES (...);
Issue: Insufficient permissions
-- Grant missing permissions
GRANT SELECT ON ALL TABLES IN SCHEMA STAGING TO ROLE TPCH_ANALYST;
GRANT USAGE ON WAREHOUSE TPCH_WH TO ROLE TPCH_VIEWER;
Dự án này là đồ án cuối khóa Snowflake Bootcamp. Mọi đóng góp, góp ý đều được hoan nghênh!
This project is for educational purposes as part of Snowflake Bootcamp.
🎉 Chúc bạn thành công!